Lexicalized Stochastic Modeling of Constraint-Based 
Grammars using Log-Linear Measures and EM Training 
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Abstract 

We present a new approach to 
stochastic modeling of constraint- 
based grammars that is based on 
log-linear models and uses EM for 
estimation from unannotated data. 
The techniques are applied to an 
LFG grammar for German. Evalu- 
ation on an exact match task yields 
86% precision for an ambiguity rate 
of 5.4, and 90% precision on a subcat 
frame match for an ambiguity rate 



ot 2b. Experimental comparison to 
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common the fact that the probability models 
are trained on treebanks, i.e., corpora of man- 
ually disambiguated sentences, and not from 
corpora of unannotated sentences. In all of the 
cited approaches, the Penn Wall Street Jour- 
nal Treebank ( Marcus et al., 1993 ) is used, 
the availability of which obviates the standard 
effort required for treebank training — hand- 
annotating large corpora of specific domains 
of specific languages with specific parse types. 
Moreover, common wisdom is that training 
from unannotated data via the expectation- 
maximization (EM) algorithm ( pempster et 



training from a parsebank shows a 
10% gain from EM training. Also, 
a new class-based grammar lexical- 
ization is presented, showing a 10% 
gain over unlexicalized models. 

1 Introduction 

Stochastic parsing models capturing con- 
textual constraints beyond the dependen- 
cies of probabilistic context-free grammars 
(PCFGs) are currently the subject of inten- 
sive research. An interesting feature com- 
mon to most such models is the incorpo- 
ration of contextual dependencies on indi- 
vidual head words into rule-based proba- 
bility models. Such word-based lexicaliza- 
tions of probability models are used suc- 
cessfully in the statistical parsing models 
of, e.g., [Collins (1997| ), [Charniak (1997| ), or 
Ratnaparkhi (1997| ). However, it is still an 



al., 19771) yields poor results unless at least 
partial annotation is applied. Experimen- 
tal results confirming this wisdom have 



been presented, e.g., by Elworthy (1994| ) and 



open question which kind of lexicalization, 
e.g., statistics on individual words or statis- 
tics based upon word classes, is the best 
choice. Secondly, these approaches have in 



Pereira and Schabes (1992 ) for EM training 
of Hidden Markov Models and PCFGs. 

In this paper, we present a new lexicalized 
stochastic model for constraint-based gram- 
mars that employs a combination of head- 
word frequencies and EM-based clustering 
for grammar lexicalization. Furthermore, we 
make crucial use of EM for estimating the 
parameters of the stochastic grammar from 
unannotated data. Our usage of EM was ini- 
tiated by the current lack of large unification- 
based treebanks for German. However, our 
experimental results also show an exception 
to the common wisdom of the insufficiency of 
EM for highly accurate statistical modeling. 

Our approach to lexicalized stochastic 
modeling is based on the parametric family of 
log-linear probability models, which is used to 
define a probability distribution on the parses 
of a Lexical-Functional Grammar (LFG) for 
German. In previous work on log-linear mod- 



els for LFG by Johnson et al. (1999| ), pseudo- 
likelihood estimation from annotated corpora 
has been introduced and experimented with 
on a small scale. However, to our knowledge, 
to date no large LFG annotated corpora of 
unrestricted German text are available. For- 
tunately, algorithms exist for statistical infer- 
ence of log-linear models from unannotated 
data ( Riezler, 1999| ) . We apply this algorithm 
to estimate log-linear LFG models from large 
corpora of newspaper text. In our largest ex- 
periment, we used 250,000 parses which were 
produced by parsing 36,000 newspaper sen- 
tences with the German LFG. Experimental 
evaluation of our models on an exact-match 
task (i.e. percentage of exact match of most 
probable parse with correct parse) on 550 
manually examined examples with on aver- 
age 5.4 analyses gave 86% precision. Another 
evaluation on a verb frame recognition task 
(i.e. percentage of agreement between subcat- 
egorization frames of main verb of most prob- 
able parse and correct parse) gave 90% pre- 
cision on 375 manually disambiguated exam- 
ples with an average ambiguity of 25. Clearly, 
a direct comparison of these results to state- 
of-the-art statistical parsers cannot be made 
because of different training and test data 
and other evaluation measures. However, we 
would like to draw the following conclusions 
from our experiments: 

• The problem of chaotic convergence be- 
haviour of EM estimation can be solved 
for log-linear models. 

• EM does help constraint-based gram- 
mars, e.g. using about 10 times more sen- 
tences and about 100 times more parses 
for EM training than for training from 
an automatically constructed parsebank 
can improve precision by about 10%. 

• Class-based lexicalization can yield a 
gain in precision of about 10%. 

In the rest of this paper we intro- 
duce incomplete-data estimation for log- 
linear models (Sec. ^), and present the actual 
design of our models (Sec.j3|) and report our 
experimental results (Sec. 



2 Incomplete-Data Estimation for 
Log-Linear Models 

2.1 Log-Linear Models 

A log- linear distribution px{x) on the set of 
analyses ^ of a constraint-based grammar can 
be defined as follows: 



P\{x) 



Zx 



Po{x) 



where Zx = Ylx^x & ^ ^'V^^^) is a normal- 
izing constant, A = (Ai,... , A„) G IR" is a 
vector of log-parameters, v = {ui, . . . , f„) is 
a vector of property-functions : ^ ^ IR for 
i = 1, . . . ,n, \ ■ u{x) is the vector dot prod- 
uct Y17=i a,nd pq is a fixed reference 
distribution. 

The task of probabilistic modeling with log- 
linear distributions is to build salient proper- 
ties of the data as property-functions i^i into 
the probability model. For a given vector 
of property- functions, the task of statistical 
inference is to tune the parameters A to best 
reflect the empirical distribution of the train- 
ing data. 

2.2 Incomplete-Data Estimation 

Standard numerical methods for statis- 
tical inference of log-linear models from 
fully annotated data — so-called complete 
data — are the iterative scaling meth- 
ods of Darroch and Ratcliff (197^ ) and 
Delia Pietra et al. (1997| ). For data con- 



sisting of unannotated sentences — so-called 
incomplete data — the iterative method of the 
EM algorithm ( Dempster et al., 1977D has to 
be employed. However, since even complete- 
data estimation for log-linear models requires 
iterative methods, an application of EM to 
log-linear models results in an algorithm 
which is expensive since it is doubly-iterative. 
A singly-iterative algorithm interleaving EM 
and iterative scaling into a mathematically 
well-defined estimation method for log-linear 
models from incomplete data is the IM 



algorithm of Riezler (1999 ). Applying this 
algorithm to stochastic constraint-based 
grammars, we assume the following to be 
given: A training sample of unannotated 
sentences y from a set y, observed with 



Input Reference model po, property- functions vector v with constant i/^, parses 
X{y) for each y in incomplete-data sample from y. 

Output MLE model px* on X. 

Procedure 

Until convergence do 

Compute Px, kx, based on A = (Ai, . . . , A„), 
For i from 1 to n do 

. _ j_ EyeyPjy) T,^ex{y) kx(x\y)M^) 

Xi := Ai -I- 7i, 
Return A* = (Ai, . . . , A„). 



Figure 1: Closed- form version of IM algorithm 



empirical probability p{y), a constraint-based 
grammar yielding a set X{y) of parses for 
each sentence y, and a log- linear model 
pxi-) on the parses X = T,y(^y\p{y)>o ^ iv) 
for the sentences in the training corpus, 
with known values of property-functions 
u and unknown values of A. The aim of 
incomplete-data maximum likelihood esti- 
mation (MLE) is to find a value A* that 
maximizes the incomplete-data log-likelihood 

A* = argmax L(A). 
AeiR" 

Closed-form parameter-updates for this prob- 
lem can be computed by the algorithm of Fig. 
|, where u#{x) = Yli=i^i(.x)^ and kx{x\y) = 
P\{x)/ T.x&x{y) P\{x) is the conditional prob- 
ability of a parse x given the sentence y and 
the current parameter value A. 

The constancy requirement on can be 
enforced by adding a "correction" property- 
function ui: 

Choose K = vaaxx^x and 

= K — J^#(x) for all x ^ X. 
Then i^i{x) = K for all x ^ X. 

Note that because of the restriction of X to 
the parses obtainable by a grammar from the 
training corpus, we have a log- linear probabil- 
ity measure only on those parses and not on 
all possible parses of the grammar. We shall 



therefore speak of mere log-linear measures in 
our application of disambiguation. 

2.3 Searching for Order in Chaos 

For incomplete-data estimation, a sequence 
of likelihood values is guaranteed to converge 
to a critical point of the likelihood function 
L. This is shown for the IM algorithm in 
Riezler (1999| ). The process of finding like- 
lihood maxima is chaotic in that the final 
likelihood value is extremely sensitive to the 
starting values of A, i.e. limit points can be 
local maxima (or saddlepoints), which are 
not necessarily also global maxima. A way to 
search for order in this chaos is to search for 
starting values which are hopefully attracted 
by the global maximum of L. This problem 
can best be explained in terms of the mini- 
mum divergence paradigm ( [Kullback, 195S| ), 
which is equivalent to the maximum likeli- 
hood paradigm by the following theorem. Let 
p[f] = ^x<^xPi^)fi^) the expectation of 
a function / with respect to a distribution p: 

The probability distribution p* that 
minimizes the divergence D{p\ \pq) to 
a reference model pq subject to the 
constraints pl^i] = q{vi\,i = 1, . . . , n 
is the model in the parametric fam- 
ily of log-linear distributions px that 
maximizes the likelihood L{\) = 
q\[npx\ of the training dataQ. 
^If the training sample consists of complete data 



Reasonable starting values for minimum di- 
vergence estimation is to set Aj = for 
i = 1, . . . ,n. This yields a distribution which 
minimizes the divergence to po, over the 
set of models p to which the constraints 
= = 1, . . . ,n have yet to be ap- 

plied. Clearly, this argument applies to both 
complete-data and incomplete-data estima- 
tion. Note that for a uniformly distributed 
reference model po, the minimum divergence 
model is a maximum entropy model (Jaynes, 
1957 ). In Sec. ^ we will demonstrate that 



a uniform initialization of the IM algorithm 
shows a significant improvement in likelihood 
maximization as well as in linguistic perfor- 
mance when compared to standard random 
initialization. 

3 Property Design and 
Lexicalization 

3.1 Basic Configurational Properties 

The basic 190 properties employed in our 
models are similar to the properties of 
Johnson et al. (1999| ) which incorporate gen- 
eral linguistic principles into a log-linear 
model. They refer to both the c(onstituent)- 
structure and the f(eature)-structure of the 
LFG parses. Examples are properties for 

• c-structure nodes, corresponding to stan- 
dard production properties, 

• c-structure subtrees, indicating argu- 
ment versus adjunct attachment, 

• f-structure attributes, corresponding to 
grammatical functions used in LFG, 

• atomic attribute-value pairs in f- 
structures, 

• complexity of the phrase being attached 
to, thus indicating both high and low at- 
tachment, 

• non-right-branching behavior of nonter- 
minal nodes, 

• non-parallelism of coordinations. 



X € X, the expectation q[-] corresponds to the em- 
pirical expectation p[-]. If we observe incomplete data 
y £ y, the expectation q[-] is replaced by the condi- 
tional expectation p[kxi [■]] given the observed data y 
and the current parameter value A'. 



3.2 Class-Based Lexicalization 

Our approach to grammar lexicalization is 
class-based in the sense that we use class- 
based estimated frequencies fc{v, n) of head- 
verbs V and argument head-nouns n in- 
stead of pure frequency statistics or class- 
based probabilities of head word dependen- 
cies. Class-based estimated frequencies are in- 
troduced in [Prescher et al. (2000| ) as the fre- 
quency f{v,n) of a (f,n)-pair in the train- 
ing corpus, weighted by the best estimate of 
the class-membership probability p{c\v, n) of 
an EM-based clustering model on {v, n)-pairs, 

i.e., /c(w, n) = max p(c|u, n){f{v, n) + l). As is 
cgC 

shown in [Prescher et al. (2000| ) in an evalua- 
tion on lexical ambiguity resolution, a gain of 
about 7% can be obtained by using the class- 
based estimated frequency fc{v,n) as dis- 
ambiguation criterion instead of class-based 
probabilities p{n\v). In order to make the 
most direct use possible of this fact, we in- 
corporated the decisions of the disambigua- 
tor directly into 45 additional properties for 
the grammatical relations of the subject, di- 
rect object, indirect object, infinitival object, 
oblique and adjunctival dative and accusative 
preposition, for active and passive forms of 
the first three verbs in each parse. Let Vr{x) 
be the verbal head of grammatical relation 
r in parse x, and n,.(x) the nominal head of 
grammatical relation r in x. Then a lexical- 
ized property I'r for grammatical relation r is 
defined as 

if fciVr{x),nr{x)) > 

fc{vr{x'),nr{x')) Vx' ^ X{y), 
otherwise. 

The property-function Vr thus pre- 
disambiguates the parses x E of a 

sentence y according to fc{v,n), and stores 
the best parse directly instead of taking the 
actual estimated frequencies as its value. In 
Sec. ^, we will see that an incorporation of 
this pre-disambiguation routine into the mod- 
els improves performance in disambiguation 
by about 10%. 

4 Experiments 
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Figure 2: Evaluation on exact match task for 550 examples with average ambiguity 5.4 
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Figure 3: Evaluation on frame match task for 375 examples with average ambiguity 25 



4.1 Incomplete Data and Parsebanks 

In our experiments, we used an LEG grammar 
for German^ for parsing unrestricted text. 
Since training was faster than parsing, we 
parsed in advance and stored the resulting 
packed c/f-structures. The low ambiguity rate 
of the German LEG grammar allowed us to 
restrict the training data to sentences with 
at most 20 parses. The resulting training cor- 
pus of unannotated, incomplete data consists 
of approximately 36,000 sentences of online 
available German newspaper text, comprising 
approximately 250,000 parses. 

In order to compare the contribution of un- 
ambiguous and ambiguous sentences to the 
estimation results, we extracted a subcorpus 
of 4,000 sentences, for which the LEG gram- 
mar produced a unique parse, from the full 
training corpus. The average sentence length 
of 7.5 for this automatically constructed 

^The German LFG grammar is being imple- 
mei ited in the Xerox Linguisti c Environment (XLE, 



see Maxwell and Kaplan (1996| )) as part of the Paral- 
lel Grammar (FarGram) project at the IMS Stuttgart. 
The coverage of the grammar is about 50% for unre- 
stricted newspaper text. For the experiments reported 
here, the effective coverage was lower, since the cor- 
pus preprocessing we applied was minimal. Note that 
for the disambiguation task we were interested in, 
the overall grammar coverage was of subordinate rel- 
evance. 



parsebank is only slightly smaller than that 
of 10.5 for the full set of 36,000 training sen- 
tences and 250,000 parses. Thus, we conjec- 
ture that the parsebank includes a representa- 
tive variety of linguistic phenomena. Estima- 
tion from this automatically disambiguated 
parsebank enjoys the same complete-data es- 
timation properties^ as training from manu- 
ally disambiguated treebanks. This makes a 
comparison of complete-data estimation from 
this parsebank to incomplete-data estimation 
from the full set of training data interesting. 

4.2 Test Data and Evaluation Tasks 

To evaluate our models, we constructed 
two different test corpora. We first parsed 
with the LFG grammar 550 sentences 
which are used for illustrative purposes in 
the foreign language learner's grammar of 
Helbig and Buscha (1996| ). In a next step, 
the correct parse was indicated by a hu- 
man disambiguator, according to the reading 
intended in Helbig and Buscha (1996[ ). Thus 
a precise indication of correct c/f-structure 
pairs was possible. However, the average am- 



■^For example, convergence to the global maximum 
of the complete-data log-likelihood function is guar- 
anteed, which is a good condition for highly precise 
statistical disambiguation. 



biguity of this corpus is only 5.4 parses per 
sentence, for sentences with on average 7.5 
words. In order to evaluate on sentences with 
higher ambiguity rate, we manually disam- 
biguated further 375 sentences of LFG-parsed 
newspaper text. The sentences of this corpus 
have on average 25 parses and 11.2 words. 

We tested our models on two evalua- 
tion tasks. The statistical disambiguator was 
tested on an "exact match" task, where exact 
correspondence of the full c/f-structure pair 
of the hand-annotated correct parse and the 
most probable parse is checked. Another eval- 
uation was done on a "frame match" task, 
where exact correspondence only of the sub- 
categorization frame of the main verb of the 
most probable parse and the correct parse is 
checked. Clearly, the latter task involves a 
smaller effective ambiguity rate, and is thus 
to be interpreted as an evaluation of the com- 
bined system of highly-constrained symbolic 
parsing and statistical disambiguation. 

Performance on these two evaluation tasks 
was assessed according to the following eval- 
uation measures: 

#correct 



3.2. "Selected -|- lexicalized" models result 



Precision 
Effectiveness 



#correct+#incorrect ' 
#correct 



#correct+#:incorrect+#:don't know ' 

"Correct" and "incorrect" specifies a suc- 
cess/failure on the respective evaluation 
tasks; "don't know" cases are cases where the 
system is unable to make a decision, i.e. cases 
with more than one most probable parse. 

4.3 Experimental Results 

For each task and each test corpus, we cal- 
culated a random baseline by averaging over 
several models with randomly chosen param- 
eter values. This baseline measures the disam- 
biguation power of the pure symbolic parser. 
The results of an exact-match evaluation on 
the Helbig-Buscha corpus is shown in Fig. 
^. The random baseline was around 33% for 
this case. The columns list different mod- 
els according to their property-vectors. "Ba- 
sic" models consist of 190 configurational 
properties as described in Sec. |3li "Lexical- 
ized" models are extended by 45 lexical pre- 
disambiguation properties as described in Sec. 



from a simple property selection procedure 
where a cutoff on the number of parses with 
non-negative value of the property-functions 
was set. Estimation of basic models from com- 
plete data gave 68% precision (P), whereas 
training lexicalized and selected models from 
incomplete data gave 86.1% precision, which 
is an improvement of 18%. Comparing lexical- 
ized models in the estimation method shows 
that incomplete-data estimation gives an im- 
provement of 12% precision over training 
from the parsebank. A comparison of mod- 
els trained from incomplete data shows that 
lexicalization yields a gain of 13% in preci- 
sion. Note also the gain in effectiveness (E) 
due to the pre-disambigution routine included 
in the lexicalized properties. The gain due to 
property selection both in precision and effec- 
tiveness is minimal. A similar pattern of per- 
formance arises in an exact match evaluation 
on the newspaper corpus with an ambiguity 
rate of 25. The lexicalized and selected model 
trained from incomplete data achieved here 
60.1% precision and 57.9% effectiveness, for a 
random baseline of around 17%. 

As shown in Fig. ^, the improvement in per- 
formance due to both lexicalization and EM 
training is smaller for the easier task of frame 
evaluation. Here the random baseline is 70% 
for frame evaluation on the newspaper corpus 
with an ambiguity rate of 25. An overall gain 
of roughly 10% can be achieved by going from 
unlexicalized parsebank models (80.6% preci- 
sion) to lexicalized EM-trained models (90% 
precision). Again, the contribution to this im- 
provement is about the same for lexicalization 
and incomplete-data training. Applying the 
same evaluation to the Helbig-Buscha corpus 
shows 97.6% precision and 96.7% effectiveness 
for the lexicalized and selected incomplete- 
data model, compared to around 80% for the 
random baseline. 

Optimal iteration numbers were decided by 
repeated evaluation of the models at every 
fifth iteration. Fig. ^ shows the precision of 
lexicalized and selected models on the exact 
match task plotted against the number of it- 
erations of the training algorithm. For parse- 
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Figure 4: Precision on exact match task in number of training iterations 



bank training, the maximal precision value 
is obtained at 35 iterations. Iterating fur- 
ther shows a clear overtraining effect. For 
incomplete-data estimation more iterations 
are necessary to reach a maximal precision 
value. A comparison of models with random 
or uniform starting values shows an increase 
in precision of 10% to 40% for the latter. 
In terms of maximization of likelihood, this 
corresponds to the fact that uniform starting 
values immediately push the likelihood up to 
nearly its final value, whereas random starting 
values yield an initial likelihood which has to 
be increased by factors of 2 to 20 to an often 
lower final value. 

5 Discussion 

The most direct points of compar- 
ison of our method are the ap- 
proaches of Johnson et al. (1999| ) and 
Johnson and Riezler (2000| ). In the first ap- 
proach, log-linear models on LFG grammars 
using about 200 configurational properties 
were trained on treebanks of about 400 
sentences by maximum pseudo-likelihood 
estimation. Precision was evaluated on an 
exact match task in a 10-way cross valida- 
tion paradigm for an ambiguity rate of 10, 
and achieved 59% for the first approach. 
Johnson and Riezler (2000| ) achieved a gain 
of 1% over this result by including a class- 
based lexicalization. Our best models clearly 
outperform these results, both in terms of 
precision relative to ambiguity and in terms 
of relative gain due to lexicalization. A 



comparison of performance is more difficult 
for the lexicahzed PCFG of |Beil et al. (1999| ) 
which was trained by EM on 450,000 sen- 
tences of German newspaper text. There, a 
70.4% precision is reported on a verb frame 
recognition task on 584 examples. However, 
the gain achieved by Beil et al. (1999| ) due to 
grammar lexicalizaton is only 2%, compared 
to about 10% in our case. A comparison 
is difficult also for most other state-of-the- 
art PCFG-based statistical parsers, since 
different training and test data, and most 
importantly, different evaluation criteria were 
used. A comparison of the performance gain 
due to grammar lexicalization shows that our 
results are on a par with that reported in 
Charniak (199^ ). 



6 Conclusion 

We have presented a new approach to stochas- 
tic modeling of constraint-based grammars. 
Our experimental results show that EM train- 
ing can in fact be very helpful for accu- 
rate stochastic modeling in natural language 
processing. We conjecture that this result 
is due partly to the fact that the space of 
parses produced by a constraint-based gram- 
mar is only "mildly incomplete", i.e. the 
ambiguity rate can be kept relatively low. 
Another reason may be that EM is espe- 
cially useful for log-linear models, where the 
search space in maximization can be kept 
under control. Furthermore, we have intro- 
duced a new class-based grammar lexicaliza- 
tion, which again uses EM training and in- 



corporates a pre-disambiguation routine into 
log-linear models. An impressive gain in per- 
formance could also be demonstrated for this 
method. Clearly, a central task of future work 
is a further exploration of the relation be- 
tween complete-data and incomplete-data es- 
timation for larger, manually disambiguated 
treebanks. An interesting question is whether 
a systematic variation of training data size 
along the lines of the EM-experiments of 
Nigam et al. (2000) for text classification will 
show similar results, namely a systematic de- 
pendence of the relative gain due to EM train- 
ing from the relative sizes of unannotated 
and annotated data. Furthermore, it is im- 
portant to show that EM-based methods can 
be applied successfully also to other statistical 
parsing frameworks. 
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